Personalizing Web Publishing via Information Extraction
نویسندگان
چکیده
because Web search and navigation are still underdeveloped. Although Web publishing is increasingly successful, it still requires too much time and effort to precisely locate specific information. This process is often tied to traditional solutions developed outside the Web scenario—for example, information retrieval (IR) models over hypertext rather than simple text documents. Moreover, even during navigation, users have limited freedom because the links Web sites provide express the information providers’ perspective rather than the users’ information needs. This means Web content is still underused. For example, searching portals managed according to humancentered policies still relies on traditional (that is, non-Web-inspired) technologies. We cannot easily personalize navigation paths according to the users’ information needs as long as specialized professionals are charged with the design, content, and authoring News agencies offer a meaningful example of this IR problem. Most newspaper portals deliver news streams by enrichment and topical categorization. News Web sites also associate previously released news items with new items as related texts. Related news links aim to increase the browsing capability and to speed up the search process. These news items are usually found by a combination of search methods that involve keyword matching and text clustering. The related news section generally includes news items that deal with the same people (such as George W. Bush, Stanley Kubrick, or Ronaldo), events (such as the Venezia Cinema Festival), or locations (such as Pakistan, Seattle, or Salt Lake City) as the initial article. However, this bag-of-words model does not provide any explicit (logically meaningful) representation for the people or events justifying the links. Related news lists are usually accepted by editorial boards, as in the case of Financial Times, where the related stories are called Editor’s Choice. The cost of this phase is strictly dependent on the accuracy of the applied retrieval methods. However, this is a critical issue because there is no significant way to judge a link without an explanation. Consequently, it is difficult for end users to know if a related link will interest them. This method is subject to obvious bias in the delivery process (editors might emphasize material according to strategic, economic, or political agendas), and users are left no real freedom. Although user profiles can explicitly describe user interests, there is no way to trace inferences over the information the technology offers. We suggest a new scenario. Imagine a Webpublishing portal where news is categorized and enriched by extracting explicit representations of its content (hereafter, objective representation [OR]), links among news items exist according to their ORs, and users can explicitly describe their information needs in conceptual profiles that constrain news categories and links. The main idea is that different classes of users (for example, company managers versus journalists) can select similar categories via different linking policies. When link types differ, a portal could make different hypertext documents available for the related user classes. The Namic system,1 a multilingual text classifiThe authors propose a
منابع مشابه
Managing Web Sites with OntoWebber
OntoWebber is a system for creating and managing dataintensive Web sites. It aims at reducing the efforts for publishing data as static and dynamic Web pages, personalizing user experience for browsing and navigating the data, and maintaining the Web site as well as the underlying data. Based on a domain ontology and a site modeling ontology, site views on the underlying data are constructed as...
متن کاملUnstable markup: A template-based information extraction from web sites with unstable markup
This paper presents results of a work on crawling CEUR Workshop proceedings web site to a Linked Open Data (LOD) dataset in the framework of Semantic Publishing Challenge 2014. Our approach is based on so-called “templates of web site’ blocks“ and DBpedia for crawling and linking extracted entities.
متن کاملPersonalizing the Web for Multilingual Web Sources
When browsing information on large Web sites, users often receive too much irrelevant information. The WWW Information Collection, Collaging, and Programming (Wiccap) system lets ordinary users build personalized Web views, which let them see only the information they want — and in the way they prefer. It provides a set of GUI tools, including a mapping wizard, extraction agent, and presentatio...
متن کاملA Template-Based Information Extraction from Web Sites with Unstable Markup
This paper presents results of a work on crawling CEUR Workshop proceedings web site to a Linked Open Data (LOD) dataset in the framework of ESWC 2014 Semantic Publishing Challenge 2014. Our approach is based on using an extensible template-dependent crawler and DBpedia for linking extracted entities, such as the names of universities and countries.
متن کاملPresenting a method for extracting structured domain-dependent information from Farsi Web pages
Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- IEEE Intelligent Systems
دوره 18 شماره
صفحات -
تاریخ انتشار 2003